A Heuristic-Regression Approach to Crawler Pattern Identification on Clickstream Data

نویسندگان

  • Ronnie Alves
  • Anália Lourenço
چکیده

Web robots, crawlers and spiders are software agents that visit Web sites periodically for multiple purposes. Usually, these activities impel the generation of additional clickstream and pattern data that will rise the necessity for extra processing and filtering. Robots are not conventional Web users. However, some of them intentionally pretend to be so. Their requests flood Web server logs, preventing the discovery of real patterns and trends and forcing the application of additional filtering clickstreams procedures. In order to detect the influence of crawlers confusing usage pattern analysis, we designed a clickstream crawler identification and filtering system that will be described through out this paper. Based on a set of heuristic approaches, we created a filtering mechanism in order to get clean record entries about potential crawlers’ requests. This system generates a confidence factor for each treated record, representing the certainty that it had when identifying the entry as crawler request. Afterwards, a multidimensional database is populated with these records, complemented with the previous confidence factor. This allows the application of regression techniques trying to establish new and more effective crawler behaviour patterns.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Applying Clickstream Data Mining to Real-Time Web Crawler Detection and Containment Using ClickTips Platform

Web crawler uncontrolled widespread has led to undesired situations of server overload and contents misuse. Most programs still have legitimate and useful goals, but standard detection heuristics have not evolved along with Web crawling technology and are now unable to identify most of today’s programs. In this paper, we propose an integrated approach to the problem that ensures the generation ...

متن کامل

A Clickstream-based Focused Trend Parallel Web Crawler

The immense growing dimension of the World Wide Web induces many obstacles for all-purpose single-process crawlers including the presence of some incorrect answers among search results and the scaling drawbacks. As a result, more enhanced heuristics are needed to provide more accurate search outcomes in an appropriate timely manner. Regarding the fact that employing link dependent Web page impo...

متن کامل

Evolutionary Biclustering of Clickstream Data

Biclustering is a two way clustering approach involving simultaneous clustering along two dimensions of the data matrix. Finding biclusters of web objects (i.e. web users and web pages) is an emerging topic in the context of web usage mining. It overcomes the problem associated with traditional clustering methods by allowing automatic discovery of browsing pattern based on a subset of attribute...

متن کامل

Prioritize the ordering of URL queue in Focused crawler

The enormous growth of the World Wide Web in recent years has made it necessary to perform resource discovery efficiently. For a crawler it is not an simple task to download the domain specific web pages. This unfocused approach often shows undesired results. Therefore, several new ideas have been proposed, among them a key technique is focused crawling which is able to crawl particular topical...

متن کامل

An integrated heuristic method based on piecewise regression and cluster analysis for fluctuation data (A case study on health-care: Psoriasis patients)

Trend forecasting and proper understanding of the future changes is necessary for planning in health-care area.One of the problems of analytic methods is determination of the number and location of the breakpoints, especially for fluctuation data. In this area, few researches are published when number and location of the nodes are not specified.In this paper, a clustering-based method is develo...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003